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Abstract 

Multiple instance learning (MIL) has attracted great attention recently in machine learning commu- 
nity. However, most MIL algorithms are very slow and cannot be applied to large datasets. In this paper, 
we propose a greedy strategy to speed up the multiple instance learning process. Our contribution is two 
fold. First, we propose a density ratio model, and show that maximizing a density ratio function is the 
low bound of the DD model under certain conditions. Secondly, we make use of a histogram ratio between 
positive bags and negative bags to represent the density ratio function and find codebooks separately for 
positive bags and negative bags by a greedy strategy. For testing, we make use of a nearest neighbor 
strategy to classify new bags. We test our method on both small benchmark datasets and the large 
TRECVID MED 11 dataset. The experimental results show that our method yields comparable accuracy 
to the current state of the art, while being up to at least one order of magnitude faster. 

1 Introduction 

Traditional supervised learning methods require a training dataset, consisting of input and label pairs, to 
construct a classifier that can predict outputs/labels for novel inputs. However, the requirement of input/label 
pairs in the training data is surprisingly prohibitive especially for large training data. Multiple instance 
learning (MIL) is a more flexible paradigm to learn a concept given positive and negative bags of instances. 
It assumes each bag may contain many instances, but a bag is labeled positive even if only one of the instances 
in it falls within the concept. And the bag is labeled negative only when all instances in it are negative. 
The aim of MIL is to induce a concept that will classify bags of instances based on the assumption denned 
above. This paradigm has been receiving much attention in the last several years, and has many useful 
applications in a number of fields, including drug activity prediction pQ, stock selection [2], object detection 
[3] and tracking [4] , text categorization [5j [6] and image categorization [2j [7] . 

Although MIL has received an increasing amount of attention in recent years, the problem is still fairly 
undeveloped and there are many interesting open questions. The MIL problem is harder than traditional 
supervised learning methods because the learner receives a set of bags instead of a set of instances that are 
labeled positive or negative. Moreover, the learner needs to deal with the false positives in positive bags. 
Recently research [8] also shows that the halfspaces finding problem for MIL is NP-complete. And most 
current MIL algorithms are still very slow and cannot be applied to large data sets. Thus, an approximate 
algorithm to efficiently find implicit or explicit decision boundary is vital to solve the MIL problem. 

In this paper, we propose a greedy multiple instance learning method (GMIL) via codebook learning and 
nearest neighbor voting. Our approach is inspired by the definition of training bags, as well as the Diverse 
Density (DD) [2] and Citation kNN models [9 j. If there's one true concept t in positive bags, it must be the 
intersection of all positive bags and it must be far from negative bags. In other words, it has higher density 
on positive bags and lower density on negative bags. Thus, we present the density ratio model and derive 
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Figure 1: An overview of our training process to find targets of positive bags, and negative candidate centers 
of negative bags. First, for all instances, we reduce the high feature space to low dimension by using PC A 
(optional step) for feature selection; then we cluster all instances from both positive bags and negative bags 
into K clusters. Next, we count instances and bin them into K centers for positive bags and negative bags 
separately. Lastly, by computing the histogram ratio between positive bins and negative bins, we rank each 
bin to discover target codebooks and negative candidate centers. These discriminative codebooks can be 
used to classify new bags through a nearest neighbor strategy. 



the relationship between the DD model and our method. Instead of maximizing a likelihood function, we 
maximize a density ratio between positive bags and negative bags. Then, we take a greedy strategy to select 
codebooks for positive targets and negative candidates separately by sorting the histogram ratio between 
positive bags and negative bags. Our algorithm is very fast, which is linear in the number of total instances 
from all training bags. See Fig. ([I]) for the training process. As for classification, we take a nearest neighbor 
strategy. We test our method on benchmark dataset and the TRECVID MED 11 dataset, and our results 
demonstrate that our method is comparable to the state of the art. 

The rest of the paper is structured as follow: In Section 2, we surveyed the related work. We describe the the 
greedy multiple instance learning algorithm in Section 3. Details of the experimental setup and the results 
of the experiment are given in Section 4, and we conclude in Section 5. 



2 Related work 

One of the earliest algorithms for learning from multiple instances was developed by Dietterich et al. [1] for 
drug activity prediction. Their algorithm, the axis-parallel rectangle (APR) method, expands or shrinks a 
hyper-rectangle in the instance feature space with the goal of finding the smallest box that covers at least 
one instance from each positive bag and no instances from any negative bag. Following this seminal work, 
there has been a significant amount of research devoted to MIL problems using different learning models, 
such as DD [2], EM-DD [10], and extended Citation kNN [9]. 

Due to the success of the SVM algorithm [5 , and the various positive theoretical results behind it, maximum 
margin methods have become extremely popular in machine learning. Moreover, to improve classification 
accuracy, many variations of SVM have been proposed by changing constraints, the objective functions, 
space projection, kernels, etc. Andrews et al. [5] combined MIL with SVM first to cope with MIL problems. 
Later methods focus more on applying discriminative models to solve MIL problems, such as SVM [TTJ [12] 
HS1 HU [15] [16], neural networks [TTJ [18], boosting [19] [4], regression for feature selection [6] [20] . decision 
trees [21], mixtures of Gaussian [22] . Gaussian processes [23] . conditional random fields [24 and manifold 
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learning [25]. To our knowledge, most work on MIL has paid little attention to efficiency or testing MIL 
approaches on large datasets. There is also no work on learning codebooks for both positive bags and 
negative bags. As mentioned in [2], the assumption that all bags intersect at a single point is not necessary, 
DD assumes more complicated concepts. However, it was not generalized to learn negative targets to deal 
with false positives. Recently, discriminative dictionary learning [26] greatly improves accuracy for visual 
object recognition. Thus, learning a good dictionary is vital for a discriminative approach. We argue that 
learning discriminative targets, including both positive and negative centers, can yield a better representation 
for training data. But how to choose discriminative clusters for MIL problems? In this paper, we propose 
a greedy multiple instance learning via codebook learning and nearest neighbor voting. We introduce the 
density ratio function which is low bound of the DD model. We also generalize DD model to learn both 
positive and negative centers, which will help reduce the false positives in classification. Moreover, our 
method is very fast for training, which is linear in the number of all instances, at least an order of magnitude 
faster than comparable methods, and can be applied to large dataset. 



3 Greedy multiple instance learning 

In this section, we present a greedy strategy for multiple instance learning. We maximize the density (his- 
togram) ratio between positive bags and negative bags to find codebooks for both targets and negative 
candidate centers, which can be used to classify novel bags. The notation used in this paper for bags and 
instances is the one introduced by Maron and Lozano-Perez [2]. 



3.1 Density ratio function 

Let D be the labeled data which consists of a set of positive bags and negative bags. We denote all positive 
bags as B + and all negative bags as B~ . We denote Bf as the i th positive bag, the j th point (instance) in 
that bag as Bf, and the value of the k th feature of that point as Bf k . Likewise, B~- represents a negative 
point. For training data D = {Bf , £?f , ...,£?"}, assuming for now that the true concept is a single 

point t (using Bayes' rule and assuming i.i.d for observations and a uniform prior over concept location). 
Thus, to find the true concept is equivalent to finding a point, which has high density on positive bags, and 
low density on negative bags. As a result, we maximize the following density ratio: 

Y,tii: j Pr(x = t\B±) 

arg max-—- — — ■ — (1) 

* ET=iE J Pr(x = t\B- j ) 

Then, we can employ kernel density estimation strategy [27 to calculate the probability at the point x, and 
rewrite Eq. as the following 

are' max — — — — - — - — ^- ( 2 ) 



where h n is the edge length in a ^-dimension volume, ip is a Parzen windowing function defined on ^-dimension 
vector u: 

1 \uj\ < \,j = 
otherwise 



<p(u) 



Note that we ignore the constants in Eq. We also argue that we can find the true concept by maximizing 
Eq. ([5]). Intuitively, if one of the instances in a positive bag is close to t, then (p(^^\Bf) « 1, and 
cp(^-\B^) « 0. Further, if every positive bag has an instance x near to t and no negative bags are close 
to t : then x will have high density in Eq. &2\. Let us consider the case where \/x G Bf is not close to 
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the true concept t\ then the numerator will be approximately zero, but the denominator will be larger than 
zero because it must fall into negative bags. If x is close to the target, it means the denominator will be 
approximately zero, while the numerator will approximate the total number of positive bags because each of 
them contains at least one positive instances. Hence, Eq. Q can help find the target hypothesis t. 

However, it is always difficult to define the length of an edge h n in a <i-dimension hypercube space. In 
addition, maximizing; Eq. |2) is a tough problem because it is not a continuous and differentiable function. 
We can use a brute force searching strategy to find the target, but it is time-consuming. Because all positive 
bags contains at least one instance x close to t, and no negative bags are close to t, thus we can make use of 
clustering methods to allocate all instances into K bins. Then, we can approximately calculate Eq. ([2| by 
maximizing the ratio between positive and negative histograms. 

We make use of K-means to cluster all instances into K clusters, denoted as Ci, C2, Ck- For positive bags, 
we calculate the following and count the instances (frequency) falling into K bins 



\-\- bin(k)^~ 

EE bin(k)+ ■ 



bin(k)+ = #(B±eC k ),k = l,...,K (3) 

By normalizing over all K bins, we can get the histogram for B + as: h(ky 
Likewise, we can count instances falling into K bins for negative bags 

bin(k)- = e C fc ), k = 1, K (4) 

Similarly, we can calculate the histogram for B~ as: h(k)~ = ^ m ( fe ) ? k = 1. ...,K. 
Thus, we can make use of the following formula to approximate Eq. Q 

gjg(glgg h(k)+ 

— — — — —, oc are' max —ttt 

E^E^(^l^) \ h(k) 

where k = 1, K. 



arg max — ^ — ™ 7 _ > _ x cx arg max — — (5) 



3.2 Target codebook discovery 

It is possible that the denominator in Eq. ([5| might be equal to 0. We introduce a small positive constant e 
to avoid such a situation. In addition, we want larger bin(k) + in positive bags to have higher priority chosen 
as target centers; we introduce the Sigmoid function as weights and reformulate Eq. ([5| as follows: 

h(k)+ Mn(k)+ -n/K. , , 

argmax— — x a( — = (6) 

& h(k)~ + e n/K 

where a(x) = , and n is the total number of positive bags B + . Remember that n/K is just the average 

number of positive instances in each bin if there is only one positive instance in each positive bag. Roughly 
speaking, if x is the intersection of n bags, it should aggregate into one bin, and that bin's frequency should be 
larger than n/K. Thus, we want to increase the weights of bins which have higher numbers of instances from 
positive bags. It is straightforward to choose k that maximize Eq. ([6| for one target point. Furthermore, 
rather than having just one target point £, it is also straightforward to select the second center (or bin) 
with the next largest histogram ratio. To find more target candidates, one can sort the ratio of each bin 
from largest to smallest in Eq. and then greedily select the largest for example p centers as the target 
codebooks, denoted as C + = {C^~ \ i = l,...,p}, where p > 1. Note that such greedy strategy to search 
codebooks makes sure our targets appear with higher probability in positive bags and lower probability in 
negative bags. 
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As for negative candidate centers, we take a different strategy. Note that since all instances from negative 
bags are negative, it means that the most counted cluster centers in Eq. Q can be used as the codebooks for 
all negative instances. In this paper, we sort the negative histogram h(k)~ descendingly, with k = 1, 
and choose the first q centers as the negative representatives, denoted as C~ = {C~ \ i = l,...,q}, where 
q > 0. In a sense, our approach generalizes the DD model to learning both positive and negative targets. 

3.3 Nearest neighbor for classification 

Assuming we have learned data centers: C + with p positive centers and C~ with q negative centers, we 
employ nearest neighbor to classify new bags. For each new bag, we calculate its distance to both p positive 
centers and q negative centers, and find the minimum Hausdorff distance [28] as the distance between the 
bag and the learned codebooks. 

Let us consider two situations below. (1) Assume every positive bag shares only one target, namely p = 1 
and q = 0. In this situation, we set a threshold r to measure "closeness" for a new unlabeled bag B. In 
this paper, we define r as the mean distance from input bags to the target. Such a thresholding strategy is 
similar to the probability threshold in the DD model. (2) Note that all bags intersecting at a single point is 
not necessary. We can assume more complicated concepts, for example, p > 1 and q > 1. For a new bag B 
with instances x = {a^, i = 1, n}, we label the bag 



where distfa y) is the minimum Hausdorff distance between two sets x and y. C + and C are positive and 
negative centers respectively. 

We prefer kNN to SVM, although kNNs have similar "behavior" to SVMs (both of them need to learn an 
explicit or implicit decision boundary from training data). Since, we have learned discriminative centers, we 
can directly apply kNN for classification. Additionally, kNN has good Bayesian bound; the error rate of a 
kNN classifier tends to the Bayes optimal theoretically as the sample size tends to infinity. 

3.4 Relationship with Diverse Density model 

For one target hypothesis problem, we derive that the density ratio model is the low bound of the Diverse 
Density model. 

Definition For two classes problem with balanced training data (m = n), if the positive training data 
is separable from the negative ones in Euclidean space, then we say the training data is well distributed. 
Furthermore, if the desired target in positive bags is separable from all negative instances (contained in both 
positive and negative bags), we say the multiple instance learning problem is well distributed. 

One of the well known example is the Gaussian distribution. In order to find the desired target, in other 
words the intersection of the positive bags, we hope it obeys Gaussian distribution and is separable from the 
negative instances. 

Lemma 1. If the multiple instance learning problem is well distributed, then for \/x G B + that is close to 
the target t, we have <p(j^\B±) > Lp(^\B~ j ). 

This is straightforward from the definition. 




(7) 
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Theorem 2. For the 2n variables v^~,v i G (0,1], i = l,...,n ; if > v- , Vi,j G [l,n] ; t/ien u>e ftave 
v 2 ™ w- — n 2n «• * 



Proof. We only prove this for n=2, and it is easy to extend it to more general situation. Assume we have 
only 4 variable, v^v^, v± and , where > vj, i,j G [1,2], then we have to prove > vlXvl • 



V 1 V 2 _ V t + V 2 (g) 

v t v i( v i +^2") - ^r^(^ + ^2") 



v l V 2 ( V l + ^2 ) 



(9) 
(10) 

v l V 2 ( V l + ^2 ) 

>o (11) 

□ 

Lemma 3. For well distributed MIL problem, if there are balanced training number of bags, and further Mi, 
J2j L Pi.^^\^tj) — ^(tt^ then density ratio is the low bound of Diverse density model. 

Refer Appendices for the proof. Note that the condition Vi, V- ipt^F^lB^-) > V. cpt^F^lB^-) is very weak. 
Because there's at least one positive instance in positive bags, thus 3x G Bf close to the target t, such that 
y?(^=^|i^) « 1. While Vx G B^, it is far from t, thus y?(^=^|5^) « 0. According to such analysis, we can 
conclude the condition is weak. 



3.5 Algorithm 

We summarize the above discussion in pseudo code. Considering that the traditional K-means depends on 
the initial clusters, we use more robust K-means++ [29] in our experiment. 



3.6 Complexity 

Suppose we have a total n positive bags and m negative bags, with N total instances. The complexity of 
our algorithm consists of K-means clustering and computing histograms for both positive bags and negative 
bags. The other steps for training only operate on constant numbers, so their corresponding time can be 
ignored. Note that the complexity of our method is dominated by K-means, which can be finished in O(NK). 
For large datasets, we randomly sample 10000 instances from the training data, and employ K-means++ to 
partition them into K clusters, and then assign all other instances into the K centers. 



4 Experiments 

We conducted experimental evaluation on five benchmark datasets including the traditional MUSK datasets 
(Muskl and Musk2) pQ and image datasets (Tiger, Elephant, and FoxQ We also evaluate our method on 
a large dataset, TRECVID MED 11 datasetQ Ten- fold cross-validation was used and the per- fold average 

] http: //www. cs . Columbia. edu/~andrews/mil/datasets ,html| 
http: //www.nist . gov/itl/iad/mig/medll . cfm 
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Algorithm 1 



1: Initialize K, e, p, q, r; 

2: Partition D — D\, £*io; / / 10-fold cross validation 

3: for i = 1; i <= 10; i + + do 

4: D t = D — Df, / / D t training data, for testing 

5: Do k-means++ and cluster D t into K bins; 

6: Count instances which fall into K bins from B + using Eq. (§; 

7: Normalize items in K bins from B + 'J / compute histogram for B + 

8: Count instances which fall into K bins from B~ using; Eq. @; 

9: Normalize items in if bins from B~\ / / compute histogram for B~ 
10: Compute histogram ratio between B + and B~ according to Eq. (J6|; 
11: Find the first p targets as positive codebook according to Eq. (|6|; 
12: Find the first q negative candidate centers as negative codebook according to Eq. ([4] 
13: Compute accuracy for Di by nearest neighbor strategy; 

14: end for 

15: Return average accuracy; 



test classification performance was calculated for evaluation. For parameter settings, in general, K is related 
to dataset size. Larger dataset, larger K. r is related to variance of each cluster after K- means clustering, 
p and q are decided by the numbers of the positive bags and negative bags. Specifically, we set p > q for 
balanced training dataset, and p < q for unbalanced training data (more negative data). All parameters in 
the experiments are determined empirically. 



4.1 Benchmark MIL datasets 

The MUSK datasets have widely served as the benchmark dataset for the MIL algorithms. The feature vector 
for both Muskl and Musk2 is 166-dimensional. The MUSK1 contains a total of 92 bags (47 positive and 45 
negative), with approximately 6 instances per bag. The Musk2 dataset contains 102 bags (39 positive/63 
negative), with 65 instances per bag on average. The COREL image dataset is 230 dimensional, containing 
three object categories: tiger, elephant, and fox. Each of the three categories consist of 200 bags (100 positive 
and 100 negative), with about 6 instances per bag. We compare our method with many others, ranging from 
classical MIL algorithms (the Matlab code^ of DD, EM-DD, Citation-KNN and mi-SVM are available) 
to recent models (miGrapfj^J CRF-MIL, GPMIL) in multiple instance learning. Tab. ([!]) summarizes the 
performance of twelve MIL algorithms in the literature. There are five parameters e, p, q, r need to be 
specified for our method. We set K = 10, p = 2, q = and r = 0.082 for Muskl dataset. As for Musk2 
dataset, we set K = 22, p = 2, q = 2. In order to describe how K, p and q influence accuracy, we vary K 
and q respectively on Musk2 dataset. We demonstrate that learning negative codebooks is necessary, while 
too large q will decrease accuracy, see Fig. Q for an intuitive understanding. As for COREL image dataset, 
we set K = 20, p = 4, q = 4 for all three objects. Our method can yield accuracy that is comparable to most 
MIL methods on the benchmark dataset. For the fox dataset, our highest accuracy is 70%, which is higher 
than previous methods. See Table. ([T]). The comparisons of different methods demonstrate again that no 
single MIL algorithm outperforms the others across all data sets [6]. 

We also compare efficiency between different methods conducted in the same environment (PC + Mat- 
lab2010a). The efficiency of an algorithm was measured based on the running time, i.e., the CPU time 
measured in seconds until the algorithm meets the stopping criterion. Note that in order to speed up the 
DD and EM-DD algorithms, we set parameters, such as iterations and F-count for Matlab function fmincon 

3 http: //www. cs . emu. edu/~juny /MILL/ 
: //lamda.nju. edu. cn/Data. ashx 
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Method 


Accuracy for benchmark dataset 


Muskl 


Musk2 


Elephant 


Fox 


Tiger 


A T)T) 1 

ArK[l 


92.4 


89.2 


AT / A 

N/A 


AT / A 

N/A 


AT / A 

N/A 


rvrv pTI 

JJJJ[z 


88.9 


82.5 


AT / A 

N/A 


AT / A 

N/A 


AT / A 

N/A 


E1V1-JJJJ[1U 


QA Q 

o4.o 


QA O 


TQ Q 


OD.l 


TO 1 
/ Z.l 


^nation kin in iy 


92.0 


86.3 


AT / A 

N/A 


AT / A 

N/A 


AT / A 

N/A 


mi-oVM[o 


87.4 


86.3 


80.0 


57.9 


78.9 


A /TIT" 1 A Qfl 

lYilLvA[oU 


QA A 
54.4 


on k 


QO K 
oZ.D 


oz.u 


oZ.U 


MT SVM -I- DA 14 


85.7 


83.8 


82.0 


63.5 


83.0 


PPMM Kernel [22] 


95.6 


81.2 


82.4 


60.3 


80.2 


miGraph[l6] 


90.0 


90.0 


85.1 


61.2 


81.9 


CRF-MIL[24] 


88.0 


85.3 


85.0 


67.5 


83.0 


GP-MIL[23] 


89.5 


87.2 


83.8 


65.7 


87.4 


GMIL (this paper) 


85.0 


83.2 


81.0 


70.0 


82.0 



Table 1: Accuracy comparison on five standard datasets for different methods. The results of other methods 
are taken from the respective papers. It demonstrates that our method can yield comparable results. Note 
that except Citation kNN [9], the results of all other methods are based on 10 fold cross-validation. 



accuracy changes with the number of clusters 



" •□ - -o- - D - -a 



5 10 15 20 25 30 35 40 45 50 
the number of clusters 



(a) 



accuracy changes with the number of negative centers 



123456789 10 
the number of negative centers 



(b) 



Figure 2: (a) We fixed p = 2 and q = 2, and vary K. It describes accuracy changes with the number of 
clusters for Musk2 dataset. (b) We fix K = 22, p = 2, and vary the number of negative centers q for Musk2 
dataset. It shows that learning negative codebooks is useful, but too large q will decrease accuracy. 

smaller than default value in the source code. We also test the MIL- Ensemble source cod^El Because it is 
equally slow as DD, we do not include its results in Table A significant advantage of our method is its 
speed: it is faster than all other methods by at least an order of magnitude. Overall our method is ranked fifth 
among those in the evaluation, making it comparable to the other global techniques. However these other 
techniques all run one order of magnitude more slowly than our method. It is also important to note that our 
results are based only on the simple K-means method, whereas other methods use additional information as 
well as more sophisticated data costs. We used simple K-means with Euclidean distance because our focus 
here is on the algorithmic techniques, and demonstrating that our method produces similar quality results 
much more quickly. 

5 http : //lamda.nju. edu. cn/code_MIL- Ensemble . ashx 
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Time (sec) for different dataset 



Muskl Musk2 Elephant Fox Tiger Avg. rank Code 



DD 


1725.9 


>3600 


>3600 


>3600 


>3600 




Matlab 


em-dd [na 


1693.3 


>3600 


>3600 


>3600 


>3600 




Matlab 


Citation kNN [9] 


18.1 


2920.9 


153.9 


139.1 


120.5 


4 


Matlab 


MI-SVM 


3.2 


281.8 


22.1 


31.4 


32.1 


2 


Matlab/C++ 


miGraph [16] 


21.3 


150.1 


102.9 


102.4 


101.4 


3 


Matlab/C++ 


CRF-MIL[24] 


200.0 


N/A 


N/A 


N/A 


N/A 




Matlab 


GMIL (Our method) 


1.7 


13.7 


11.9 


8.1 


9.6 


1 


Matlab 



Table 2: Time required for 10- fold cross validation by different methods with source code available, except 
CRF-MIL. Note that CRF-MIL's running time for Muskl was taken in the corresponding paper. This 
experiment result was based on a PC with 1 Intel Core 2 Duo CPU(2.26 GHz), 4GB RAM and MATLAB 
R2010a. 




E001 E002 E003 E004 E005 



Figure 3: Sample images from the five events in the MED11 dataset: (E001) attempting a board trick; (E002) 
feeding an animal; (E003) landing a fish; (E004) wedding ceremony; (E005) working on a woodworking 
project. 

4.2 Experiments on the TRECVID MED11 

The TRECVID MED 11 was used for evaluating our performance on event recognition. We use the first 
five events containing 813 video clips with millions of frames, representing complex activities and events, see 
Fig. ([3| for sample images. We extract local features using HOG3D [31]. To represent videos using local 
features, we apply a common bag-of-words model with a 1000-word codebook. Each frame is represented as 
a histogram of occurrences of the codebook elements. Because this dataset requires large memory, we run it 
on 24 cores machine (Intel(R) Xeon(R) CPU X5650 @2.67GHz), with memory 48GB. 



Time comparison between our method an MI-SVM 




100 200 300 400 500 600 700 800 

The number of bags 



Figure 4: The running time comparison. We randomly sample #bags from the 813 total bags and test the 
running time of two methods respectively. 
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, - , , Accuracy (%) comparison on MED11 dataset 

° EQQ1 EQQ2 EQQ3 EQQ4 EQQ5 Avg. Accuracy Time (Days) 

MI-SVM [5] 3L20 43^4 28^0 5L64 67M 45/78 > 7 days 

GMIL (this paper) 57.20 47.70 71.90 86.70 76.70 68.04 1 ~ 2 days 



Table 3: Accuracy comparison on five events for different methods (10 fold cross validation) on 24 cores 
machine (Intel(R) Xeon(R) CPU 702 X5650 @2.67GHz), with 48GB RAM. It demonstrates that our method 
can yield competitive results. 



In this experiment, we only have video level labels. Thus, we treat event recognition as a MIL problem. 
Each video can be seen as a bag, and its frames in the corresponding bag as instances. To evaluate the 
performance of our method, we use MI-SVM as baseline because structural SVM (svmlight [^] or liblinear 
SVM) requires linear time for training (just linear kernel). Thus it is faster compared to other methods; for 
example, mi-Graph has the time complexity 0(N 2 (m + n) 2 ). In the experiment, MI-SVM deploys liblineaij^ 
for training. For evaluation, we take one-vs-all strategy. For our method, we set K = 50, p = 3, and q = 9 
to train on the unbalanced dataset. The results in Table |3| demonstrate that our method yields competitive 
results. We also compare the running time between our method and MI-SVM, see Fig. [4] for details. 



5 Conclusion 



In this paper, we propose a greedy multiple instance learning method that leverages codebook learning and 
nearest neighbor voting. Instead of maximizing a likelihood function, we take a greedy strategy to maximize 
the histogram ratio between positive bags and negative bags. By learning codebooks for both targets and 
negative centers, we can use them to classify novel bags based on nearest neighbor strategy. The primary 
contribution of this paper is to maximize the density ratio to speed up the learning process. Another 
contribution is learning both targets and negative candidate centers to reduce false positives. Experimental 
results show that our method is significantly faster and effective compared to the state of the art. In 
future work, we will consider weight to cluster instances to learn the codebooks. For example, we can use 
Mahalanobis distance as a distance measure in K-means. We also plan to investigate spectral clustering or 
kernel k-means methods to learn the codebooks. It is worth mentioning that we can use a more theoretical and 
more sophisticated kernel function for kernel density estimation in order to improve classification accuracy. 
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Appendices 



Proof. The Diverse Density model (a noise-or model) maximizes the following equation: 
argmaxPr(x = t | B+ B+)Pr(x = t | B~ , B~) 



= arg max JJPr(z = t | ^+)JJPr(x = t | B~) 

x i=i 1=1 

n 

= argmax]J (l - ]J (l - Pr(x = t \ B±) 



B + likelihood term 

m 

*\[X[(l-Pr{x = t\B- j )) (12) 

1 = 1 j 



B likelihood term 



Note that the left likelihood term and the right term share a common term (l — Pr(x = t | B^-yj (the symbol 
* can be + or -), except one instance is from positive training bags, while the other instance is from negative 



training bags. Thus, to maximize Eq. (12) means that we need to minimize Yij (l — Pr(x = t \ Bf-)) and 
maximize Yij (l — Pr(x = t \ B~-)) simultaneously. Equivalently, we can further minimize the following ratio: 

. Ui=iU j (-i--Pr(x = t\Bt j )) 

* lli=illj (l-Pr{x = t\ By)) 

U7 =1 U j Pr(x = t\B±) 

— arg max——: — — ■ — (13) 

x Y\T=iYl J Pr(x = t\B- j ) 

Eti E j Pr(x = t\B+) 

> arg max-—- — — ■ ^— 

~ x ET=iE J Pr(x = t\B- j ) 

The last two equations can be derived from theorem [2] Thus, the density ratio Eq. Q is the lower bound 
of the Diverse Density model. Intuitively, if one of the instances in a positive bag is close to x, then 
Pr(x = t | Bf-) is highly peaked, and Eq. (h| will have high value, which guarantees high Diverse Density in 



Eq. (12). Further, if every positive bag has an instance close to x and no negative bags are close to x, then 
x will have high value in Eq. 0, as well as high Diverse Density in Eq. (12). In other words, to maximize 



the density ratio can directly maximize the DD model. □ 
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